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Abstract 

We introduce a novel approach for analyzing the performance of first-order black- 
box optimization methods. We focus on smooth unconstrained convex minimization 
over the Euclidean space M. d . Our approach relies on the observation that by definition, 
the worst case behavior of a black-box optimization method is by itself an optimization 
problem, which we call the Performance Estimation Problem (PEP). We formulate and 
analyze the PEP for two classes of first-order algorithms. We first apply this approach 
on the classical gradient method and derive a new and tight analytical bound on its 
performance. We then consider a broader class of first-order black-box methods, which 
among others, include the so-called heavy-ball method and the fast gradient schemes. 
We show that for this broader class, it is possible to derive new numerical bounds on 
the performance of these methods by solving an adequately relaxed convex semidefinite 
PEP. Finally, we show an efficient procedure for finding optimal step sizes which results 
in a first-order black-box method that achieves best performance. 

Keywords: Performance of First-Order Algorithms, Rate of Convergence, Complex- 
ity, Smooth Convex Minimization, Duality, Semidefinite Relaxations, Fast Gradient 
Schemes, Heavy Ball method. 



1 Introduction 

First-order convex optimization methods have recently gained in popularity both in theoret- 
ical optimization and in many scientific applications, such as signal and image processing, 
communications, machine learning, and many more. These problems are very large scale, 
and first-order methods, which in general involve very cheap and simple computational itera- 
tions, are often the best option to tackle such problems in a reasonable time, when moderate 
accuracy solutions are sufficient. For convex optimization problems, there exists an exten- 
sive literature on the development and analysis of first-order methods, and in recent years, 
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this has been revitalized at a quick pace due to the emergence of many fundamental new 
applications alluded above, see e.g., the recent collections UH] and references therein. 

This work is not on the development of new algorithms, rather it focuses on the theoretical 
performance analysis of first order methods for unconstrained minimization with an objective 
function which is known to belong to a given family T of smooth convex functions over the 
Euclidean space M. d , the function itself is not known. 

Following the seminal work of Nemirovski and Yudin [12] in the complexity analysis 
of convex optimization methods, we measure the computational cost based on the oracle 
model of optimization. According to this model, a first-order black-box optimization method 
is an algorithm which has knowledge of the underlying space M d and the family J 7 , the 
function itself is not known. To gain information on the function to be minimized, the 
algorithm queries a first order oracle, that is, a subroutine which given as input a point in 
M. d , returns the value of the objective function and its gradient at that point. The algorithm 
thus generates a finite sequence of points {xi G M. d : i = 0, . . . , N}, where at each step the 
algorithm can depend only on the previous steps, their function values and gradients via 
some rule 

x g R d , x i+1 = A(x , ...,Xi, f(x ), . . . , f(xi); f'(x Q ), /'(a*)), i = 0, . . . , N - 1, 

where /'(•) stands for the gradient of /(■). Note that the algorithm has another implicit 
knowledge, i.e., that the distance from its initial point xq to a minimizer G X*(f) of / is 
bounded by some constant R > 0, see more precise definitions in the next section. 

Given a desired accuracy e > 0, applying the given algorithm on the function / in the 
class J 7 , the algorithm stops when it produces an approximate solution x e which is e-optimal, 
that is such that 

f(x e ) - /(a:*) < e. 

The performance (or complexity) of a first order black-box optimization algorithm is then 
measured by the number of oracle calls the algorithm needs to find such an approximate 
solution. Equivalently, we can measure the performance of an algorithm by looking at the 
absolute inaccuracy 

S(f,x N ) = f(x N ) - f(x*), 

where xn is the result of the algorithm after making N calls to the oracle. Throughout this 
paper we will use the latter form to measure the performance of a given algorithm. 

Building on this model, in this work we introduce a novel approach for analyzing the 
performance of a given first order scheme. Our approach relies on the observation that by 
definition, the worst case behavior of a first-order black-box optimization algorithm is by 
itself an optimization problem which consists of finding the maximal absolute inaccuracy 
over all possible inputs to the algorithm. Thus, with x^ being the output of the algorithm 
after making N calls to the oracle, we look at the solution of the following Performance 
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Estimation Problem (PEP): 

max f(x N ) - f(x*) 
s.t. /eJ, 

x i+1 = A(x , ...,Xi] f(x ), . . . , f(xi); f'(x ), . . . , f'(xi)), i = 0, . . . , N - 1, (P) 
x* e X*(f), \\x* — x || < -R 
x , • • • , xat, x* e R d . 

At first glance this problem seems very hard or impossible to solve. We overcome this 
difficulty through an analysis that relies on various types of relaxations, including duality 
and semi-definite relaxation techniques. The problem and setting, and an outline of the 
underlying idea of the proposed approach for analyzing ((Pj) are described in Section HJ In 
order to develop the basic idea and tools underlying our proposed approach, we first focus on 
the fundamental gradient method (GM) for smooth convex minimization, and then extend 
it to a broader class of first order black box minimization methods. Obviously, the gradient 
method is a particular case of this broader class that will be analyzed below. However, 
it is quite important to start with the gradient method for two reasons. First, it allows 
to acquaint the reader in a more transparent way with the techniques and methodolgy we 
need to develop in order to analyze (PEP), thus paving the way to tackle more general 
schemes. Secondly, for the gradient method, we are able to prove a new and tight bound on 
its performance which is given analytically, see Section [3J Capitalizing on the methodology 
and tools developed in the past section, in Section HI we consider a broader class of first- 
order black-box methods, which among others, is shown to include the so-called heavy-ball 
[16] and fast gradient schemes [14] . Although an analytical solution is not available for this 
general case, we show that for this broader class of methods, it is always possible to compute 
numerical bounds for an adequate relaxation of the corresponding PEP, allowing to derive 
new bounds on the performance of these methods. We then derive in Section [5] an efficient 
procedure for finding optimal step sizes which results in a first-order method that achieves 
best performance. Our approach and analysis give rise to some interesting problems leading 
us to suggest some conjectures. Finally, an appendix includes the proof of a technical result. 

Notation. For a different iable function /, its gradient at x is denoted by fix). The 
Euclidean norm of a vector x G M. d is denoted as ||x||. The set of symmetric matrices in 
jgmxn j g denoted D y p or two symmetric matrices A and B, A y B, (A y B) means 
A — B y (A — B y 0) is positive semidefinite (positive definite). We use e, for the i-th 
canonical basis vector in M. N , which consists of all zero components, except for its i-th entry 
which is equal to one, and use v to denote a unit vector in M. d . For an optimization problem 
(P), val(P) stands for its optimal value. 
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2 The Problem and the Main Approach 



2.1 The Problem and Basic Assumptions 

Let A be a first-order algorithm for solving the optimization problem 

(M) min{/(x) : x G R d }. 
Throughout the paper we make the following assumptions: 

• / : M. d — y R is a convex function of the type C^ 1 (M d ), i.e., continuously differentiable 
with Lipschitz continuous gradient: 

\\f'(x)-f'(y)\\<L\\x-y\\, Wx,yeR d , 

where L > is the Lipschitz constant. 

• We assume that (M) is solvable, i.e., the optimal set X*(f) := argmin/ is nonempty, 
and for x* G X*(f) we set /* := /(x*). 

• There exists R > 0, such that the distance from xq to an optimal solution x* G X^f) 
is bounded by 

Given a convex function / in the class C^' 1 (R d ) and any starting point xq G R d , the 
algorithm A is a first-order black box scheme, i.e., it is allowed to access / only through the 
sequential calls to the first order oracle that returns the value and the gradient of / at any 
input point x. The algorithm A then generates a sequence of points X{ G R d , % — 0, . . . , N. 



2.2 Basic Idea and Main Approach 

We are interested in measuring the worst-case behavior of a given algorithm A in terms of 
the absolute inaccuracy }\xn) — /(#*), by solving problem ([P]) defined in the introduction, 
namely 

max f(x N ) - 

s.t. / G Ci a (R d ), / is convex, 

x i+1 = A(x , . . . , Xi, f(x ), . . . , f(xi); f(x ), . . . , f'(xi)), i = 0, . . . , N - 1, (P) 

G - x || < R, 

Xq, ■ ■ ■ , Xat, G R d . 

To tackle this problem, we suggest to perform a series of relaxations thereby reaching a 
tractable optimization problem. 

A main difficulty in problem (JP]) lies in the functional constraint (/ is a convex function 
in C^ 1 (M d )), i.e., we are facing an abstract hard optimization problem in infinite dimension. 

general, the terms L and R are unknown or difficult to compute, in which case some upper bound 
estimates can be used in place. Note that all currently known complexity results for first order methods 
depend on L and R. 
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To overcome this difficulty, the approach taken in this paper is to relax this constraint so 
that the problem can be reduced and formulated as an explicit finite dimensional problem 
that can eventually be adequately analyzed. 

An un-formal description of the underlying idea consists of two main steps as follows: 

• Given an algorithm A that generates a finite sequence of points, to build a problem in 
finite dimension we replace the functional constraint / £ C^' 1 in ([P]) by new variables in 
M. d . These variables, are the points {xq, } themselves, the function values 
and their gradients at these points. Roughly speaking, this can be seen as a sort of 
discretization of / at a selected set of points. 

• To define constraints that relate the new variables, we use relevant /useful properties 
characterizing the family of convex functions in C^' 1 , as well as the rule(s) describing 
the given algorithm A. 

This approach can, in principle, be applied to any optimization algorithm. Note that 
any relaxation performed on the maximization problem ([P]) may increase its optimal value, 
however, the optimal value of the relaxed problem still remains a valid upper bound on 

f(x N ) - f*. 

A formal description on how this approach can be applied to the gradient method is 
described in the next section, which as we shall see, allows us to derive a new tight bound 
on the performance of the gradient method. 



3 An Analytical Bound for the Gradient Method 

To develop the basic idea and tools underlying the proposed approach for analyzing the 
performance of iterative optimization algorithms, in this section we focus on the simplest 
fundamental method for smooth convex minimization, the Gradient Method (GM). It will 
also pave the way to tackle more general first-order schemes as developed in the forthcoming 
sections. 

3.1 A Performance Estimation Problem for the Gradient Method 

Consider the gradient algorithm with constant step size, as applied to problem (M), which 
generates a sequence of points as follows: 

Algorithm GM 

0. Input: / £ Ci\R d ) convex, x £ R d . 

1. For i — 0, . . . , N — 1, compute x i+ i = Xi — jf'(xi). 



Here h > is fixed. At this point, we recall that for h — 1, the convergence rate of the 
GM algorithm can be shown to be (see for example [T%1 H]): 

f(xN)-r< Lllx °~ x4 \ V*„e *,(/). (3.1) 
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To begin our analysis, we first recall a fundamental well-known property for the class of 
convex C^ 1 functions, see e.g., [Ml Theorem 2.1.5]. 

Proposition 3.1. Let f : M d — > E be convex and C£ . Then, 

^ll/'0*0 - f'iv) II 2 < /(*) - f(v) ~ <f(v),z-v), for allx,y G R d . (3.2) 

Let x G M. d be any starting point, let {x±, . . . , x^} be the points generated by Algorithm 
(GM) and let x* be a minimizer of /. Applying (13.21) on the points {xo, . . . , xat,x*}, we get: 

^ll/'fo) - f(^)W 2 < /(^) - /(^) - (/'fo),*< - '•./ (' V.*. (3.3) 



Now define 



#i := yr, 7^(f(Xi) - f(x,)), i = 0,...,JV,* 

Iv||X* — Xo || 

Pi : = 7|i — " uf'( x i), i = 0,...,N,* 

-L\\X* — Xo || 

and note that we always have <5* = and g* = 0. 
In terms of Si, gi, condition (13. 3 j) becomes 

lll^-^-f <^-^- ( f z,j = 0,...,iV,*, (3.4) 

Z ||X* — Xo|| 

and the recurrence defining (GM) reads: 

Xj+i = x.j — h\\x* — Xo\\gi, i = 0, . . . , N — 1. 

Problem ([P]) can now be relaxed by discarding the underlying function / G C^' 1 in 
(jP|) . That is, the constraint in the function space / G C^' 1 with / convex, is replaced by 
the inequalities (13 .4p characterizing this family of functions and expressed in terms of the 
variables x , . . . , Xn, x* G M. d , g , . . . , g N G ~R d and So, . . . , Sn G M generated by (GM). Thus, 
an upper bound on the worst case behavior of /(xjv) ~~ /O^*) = — Xo|| 2 5jv can be 

obtained by solving the following relaxed PEP: 

max L||x* — xq|| 2 5v 



g ,-,g N ^ d 
S ,...,5 N eR 



s.t. Xj + i = Xj — fo||x* — xo||<7i, i — 0, . . . , N — 1, 

1 II 1 1 2 • £• s- \9ji %i ■^7/ ■ ■ n at 

2 ||x*-x || 
1 1 a?* — x o\\ < -R- 
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Simplifying the PEP The obtained problem remains nontrivial to tackle. We will now 
perform some simplifications on this problem that will be useful for the forthcoming analysis. 

First, we observe that the problem is invariant under the transformation g\ 4— Qgi, 
x\ <— Qxi for any orthogonal transformation Q. We can therefore assume without loss of 
generality that x* — xq = \\x* — Xo\\v, where v is any given unit vector in M. d . Therefore, for 
i = * the inequality constraints reads 

<o*-6j J - n n j=0,...,N. 

Secondly, we consider H3.4[) for the four cases i — *, j — *, i < j and j < i, and use the 
equality constraints 

x i+ i = Xi - h\\x m - x \\g h i = 0, . . . , N - 1 

to eliminate the variables x\, . . . , x/v- After some algebra, we reach the following form for 
the PEP: 

max L\\x* — Xo\\ 2 5]y 

xo,x*,gieR d ,8ieR 



1 3 

s-t. 2 hi - 9j\\ 2 <Si- 6, - (gj, h 9t-i), i < j = 0, ■ ■ ■ ,N. 

t=i+l 

1 - 

7}\\9i - 9j\\ 2 <&%- Sj + (gj, Y h 9t-i), j < i — 0, . . . ,N 
\9if<6i, i = 0,...,N. 



t=j+i 

2 1 

1 ' 

dN| 2 < -Si - (g^y + ^2hgt-i), i = 0,...,N, 
\\x* - x Q \\ < R, 

where i < j = 0, . . . , N is a shorthand notation for % = 0, . . . , N — 1, j = % + 1, . . . , N. 

Finally, we note that the optimal solution for this problem is attained when Xq\\ = R, 
and hence we can also eliminate the variables Xq and x*. This produces the following PEP 
for the gradient method, a nonconvex quadratic minimization problem: 

max LR 2 5n 

1 3 

s -t- ~~ g j\\ 2 <Si- 8j - (gj, Y h 9t-i), i < j = 0, . . . , N, 

t=i+l 

1 - 

^hi - 9j\\ 2 <Si- Sj + (gj, Y h 9t-i), j <i = 0,...,N, 

t=j+i 

^\\gif<Si, i = 0,...,N, 
1 i 

2 \\9i\\ 2 < Si ~ (9i, U + h 9t-i)> i = 0,...,N. 

t=i 
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This problem can be written in a more compact and useful form. Let G denote the 
(N + 1) X d matrix whose rows are . . . gjj, and for notational convenience let Uj G M. N+1 
denote the canonical unit vector 

Ui = e i+u i = 0, . . . , N. 

Then for any we have 

gt = G T u h ti(G T u i u T j G) = and (G T u i ,v) = (g t ,v). 

Therefore, by defining the following (N + 1) x (N + 1) symmetric matrices 

j 



1 1 j 

A i,j = ~ u j)i u i - u j) T + o 5^ h ( U 3 U t-l + u t-iuj), 



t=i+l 
i 



t=j+i (3.5) 



1 rp 

Ci — 2 U i u i j 



1 1 % 

Di = -UiuJ + ~Y^ h (uiuJ_± + ut-iuj), 



2 

t=i 

we can express our nonconvex quadratic minimization problem in terms of 5 := (So, ■ ■ ■ , 5n) £ 
]R Ar+1 and the new matrix variable G G ]R( Ar+1 ) xd as follows 

max LR 2 Sn 

GeR( N + 1 ) xd ,Sm N+1 

s.t. tr(G T AijG) ^Si-Sj, i < j = 0, . . . , N, 

tr(G T B h:j G) <5i-5j, j < i = 0, . . . , N, ( G ) 
tx{G T C i G)<8 i , 2 = 0,..., N, 
ti(G T DiG + uujG) <-8i, i = 0,..., N. 

Problem (G) is a nonhomogeneous Quadratic Matrix Program, a class of problems recently 
introduced and studied by Beck [3]. 



3.2 A Tight Performance Estimate for the Gradient Method 

We now proceed to establish the two main results of this section. First, we derive an upper 
bound on the performance of the gradient method, this is accomplished via using duality 
arguments. Then, we show that this bound can actually be attained by applying the gradient 
method on a specific convex function in the class C^' 1 . 

In order to simplify the following analysis, we will remove some constraints from ([G]) and 
consider the bound produced by the following relaxed problem: 

max LR 2 5n 

GeK( JV + 1 > xd ,<5eR JV + 1 

s.t. tr(G T A-i,iG) < - 5,, i = l,...,N, ( G ') 
tr(G T DiG + vujG) < -Si, i = 0, . . . , N. 
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As we shall show below, it turns out that this additional relaxation has no damaging effects 
and produces the desired performance bound when < h < 1. 



We are interested in deriving a dual problem for (JCJ) which is as simple as possible, 
especially with respect to its dimension. As noted earlier, problem ( ]Cl) is a nonhomogeneous 
quadratic matrix program, and a dual problem for (jCl) could be directly obtained by applying 
the results developed by Beck [3|. However, the resulting obtained dual will involve an 
additional matrix variable $ G E> d , where d can be very large. Instead, here by exploiting the 
special structure of the second set of nonhomogeneous inequalities given in (jG 7 !) . we derive 
an alternative dual problem, but with only one additional variable t G R. 

To establish our dual result, the next lemma shows that a dimension reduction is possible 
when minimizing a quadratic matrix function sharing the special form as the one that appears 
in problem (JCj. 

Lemma 3.1. Let f(X) = tr(X T QX + 2ba T X) be a quadratic function, where X G R nxm , 
Q G § n ; a G W 1 and ^ b G W m . Then 

inf f(X) = inf f(£b T ). 

xm nxm £eR™ 

Proof. First, we recall (this can be easily verified) that inf{/(X) : X G IR nxm } > — oo if and 
only if Q >z 0, and there exists at least one solution X such that 

QX + ab T = <£> X T Q + ba T = 0, (3.6) 

i.e., the above is just Vf(X) = and characterizes the minimizers of the convex function 
f(X). Using (USD it follows that inf x f(X) = f(X) = tr(ba T X). Now, for any £ G W\ we 
have f(£b T ) = ||6|| 2 (£ T Q£ + 2a T £). Thus, likewise, inf{/(£6 T ) : f G M n } > -oo if and only 
if Q y and there exists (6l" such that 

Q£ + a = 0, (3.7) 

and using flST]) it follows inf c /(^ T ) = /(^6 T ) = ||&|| 2 a T ^ = tr(ba T £b T ). Now, using M - 
(13. 7p . one obtains A T Q = — 6a T and Q(A — £5 T ) = 0, and hence it follows that 

f(X)-f& = tr(ba T (X-£b T )) 

= tr(-X T Q(X - lb T ) = 0. 

□ 



Equipped with Lemma [3. 1[ we now derive a Lagrangian dual for problem (JGj. 

Lemma 3.2. Consider problem (jG 7 } for any fixed h 6l and L, R > 0. ^4 Lagrangian dual 
of ( jG 7 ]) ^iven 6?/ i/ie following convex program: 

min {^Li? 2 t : A G A, S(X,t) >z 0}, (DC) 
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where A := {A G R N : X i+ i-Xi > 0, i = 1, . . . , N-l, 1-Ajv > 0, Aj > 0, i = l,...,JV}, 
the matrix S(-, •) G E> N+2 is given by 



S{X,t) 



1 — /i)^o + ^"5*1 Q 

,T + 



q 



with q = (Ai, A2 — Ai, . . . , Ajv — Ayv_i, 1 — Ajv) t and where the matrices So, Si G S N+1 are 
defined by: 

/2Ai -Ai \ 
-Ai 2A2 — A2 



and 



So 



V 

2Ai 
A2 — Ai 



S*, 



-A2 2A3 —A3 

— Ajv-i 2Ajv -Ajv 
-Ajv 1 / 



A2 — Ai ... Ajv — Ayv— 1 1 ~~ Ajv^ 
2A2 Ajv — Ajv-i 1 — Ajv 



(3i 



2Ajv 1 — Ajv 
1- Ajv 1 



(3.< 



Ajv — Ajv-i Ajv — Ajv-i 
y 1 — Ajv 1 — Ajv 

Proof. For convenience, we recast (jCEJ as a minimization problem, and we also omit the 
fixed term LR 2 from the objective. That is, we consider the equivalent problem ( 1G"|) defined 
by 



min — <5jv 

GeR( JV + 1 ) xd ,<5eiR JV + 1 



s.t. tv(G T A i _ hi G) < - 5,, i = 1, . . . , N, 
tr(G T DiG + uujG) < -Si, i = 0, . . . , N. 

Attaching the dual multipliers A = (Ai, . . . , Ajv) G IR+ and r := (r , . . . , Tjv) t G 



(G") 



9JV+1 



to 



the first and second set of inequalities respectively, and using the notation 5 = (S , . . . , <5jv), 
we get that the Lagrangian of this problem is given as a sum of two separable functions in 
the variables (S,G): 



N 



JV 



L(G,5,X,t) = -6 N + XjjSj - Sj-x) + ^ TjSj 



i=l 



N 



i=0 
N 



+ X * tr(G T A z „ M G) + J2 T i {^(G T D t G + vujG)) 

i=l i=0 

= L 1 («J,A,r) + L 2 (G,A,r). 
The dual objective function is then defined by 

H(X, t) = minL(G, 5, X r) = min Li(S, A, r) + min L 2 (G, A, r), 

G,5 5 G 
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and the dual problem of (1G"|) is then given by 

max{iJ(A,r) : AGtf.rG 
Since L±(-, A, r) is linear in 6, we have min,? Li(5, A, r) = whenever 

0, 



(DG") 



- Ai + T 

Aj — Aj + i + Tj 

-1 + Ayyr + T N 



(i = l,...,JV-l) 
0, 



(3.10) 



and — oo otherwise. Invoking Lemma [3. 11 we get 



min L 2 (G, A,r) = min L 2 (wu ,A,r). 

G G R(AT+l)xd U>6I8 JV + 1 

Therefore for any (A, r) satisfying (13.101) . we have obtained that the dual objective reduces 
to 



A' 



A 



H(X,t) 



mm 

w£R N + 



S wT Yl X * A *-^ + Yl TiDi ) w + tT%v } 



,i=i 



8=0 



max{ — t : w T 

teM 2 



'AT AT \ 

K i=l i=0 J 



1 



-~t, \/w G R N+1 } 



K . I E£i Mi-i,i + Eto r * A l r 



A' 



max <{ — t : 
2 



2 



^0 . 



where the last equality follows from the well known lemma [6j Page 163 jEl- 

Now, recalling the definition of the matrices Aj-i^, -Dj (see (13. 5p ). we obtain 

/ A x (fc - l)Ai 
(fc-l)Ai Ai + A 2 (h-l)X 2 

(h-l)X 2 A 2 + A 3 (/i-l)A 3 



A 



i— l,i — „ 



i=l 



and 



* 1 



i=0 



(h-l)X N -i Xn-i + X n (Zi-l)Ajv 
(/i - 1)Ajv Aat / 



hTN -l hTN\ 
hTN— I ^Tn 



hT N -\ hT N -l T N -1 hTN 

\ hT N hT N ■ ■ ■ hT N T N j 

Finally, using the relations (I3.10p to eliminate 7$, and recalling that val( lG"l) was defined as 
—LR 2 val ljG 7 ]) . the desired form of the stated dual problem follows. □ 



2 Let M be a symmetric matrix. Then, x T Mx + 2b T x + c > 0, Vie 6 K d if and only if the matrix 
is positive semidcfinitc. 



'M 

,b T CJ 
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The next lemma will be crucial in invoking duality in the forthcoming theorem. The 
proof for this lemma is quite technical and appears in the appendix. 

Lemma 3.3. Let 

i 

Xi = , % = 1, . . . , N, 

2N + 1-V 

then the matrices So, Si G E> N+1 defined in A3. <§j) -/ TOj) are positive definite for every iVeN. 

We are now ready to establish a new upper bound on the complexity of the gradient 
method for values of h between and 1. To the best of our knowledge, the tightest bound 
thus far is given by (13.11) . 

Theorem 3.1. Let f G C^' 1 (M a! ) and let Xo,...,% 6 I be generated by Algorithm GM 
with < h < 1. Then 

T D2 

f(x N ) ~ /(*.) < iMT 2- (3.11) 

Proof. First note that both (G) and (jG 7 !) are clearly feasible and val(G) < val ( jG 7 !) . Invoking 
Lemma [3.2[ by weak duality for the pair of primal-dual problems f lG 7 ]) and (1DG'|) . we thus 
obtain that val (JG 7 } < val ( IDG 7 }) and hence: 



f(x N ) - f* < val(G) < val ((GO) < val flDGl . (3.12) 
Now consider the following point (X,t) for the dual problem (1DG'|) : 



Aj = , i = 1, . . . , N, 

2N+1-Z 1 ' ' ' 

1 

t 



2Nh+ 1 

Assuming that this point is (jDGjJ-feasible, it follows from (13. 12ft that 

which proves the desired result. Thus, it remains to show that the above given choice (A, t) 
is feasible for ( IDG 7 !) . First, it is easy to see that all the linear constraints of ( |DG ; |) on the 
variables Xi,i = 1, . . . , N described through the set A hold true. Now we prove that the 
matrix S = S(X,t) is positive semidefinite. From Lemma |3.3[ with h G [0,1], we get that 
(1 — h)So + hS\ is positive definite, as a convex combination of positive definite matrices. 
Next, we argue that the determinant of S is zero. Indeed, take u :— (1, . . . , 1, —(2Nh + 1)) T , 
then from the definition of S and the choice of Aj and t it follows by elementary algebra that 
Su = 0. To complete the argument, we note that the determinant of S can also be found 
via the identity (see, e.g., [TJ Section A. 5. 5]): 

det(S) = (t- q T {(l - h)S + hS^q) det((l - h)S + hSJ. 

Since we have just shown that (1 — h)So + hS\ y 0, then det((l — h)So + hS\) > and we 
get from the above identity that the value of t — q T ((l — h)So + /iS'i)" 1 g, which is the Schur 
complement of the matrix S, is equal to 0. By a well known lemma on the Schur complement 
0, Lemma 4.2.1], we conclude that S is positive semidefinite. □ 
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The next theorem gives a lower bound on the complexity of Algorithm GM for all values 
of h. In particular, it shows that the bound ( 13. lip is tight and that it can be attained by a 
specific convex function in C^' 1 . 

Theorem 3.2. Let L > 0, N G N and d G N. Then for every h > there exists a convex 
function tp G C^' 1 (lR d ) and a point xq G M. d such that after N iterations, Algorithm GM 
reaches an approximate solution x^ with following absolute inaccuracy 

LR 2 ( ,„ , n2at 



p(xn) — V* = max ( , (1 — h) 

Proof. We will describe two functions that attain the two parts of the max expression in 
the above claimed statement. For the sake of simplicity we will assume that L — 1 and 
R = II 1. Generalizing this proof to general values of L and R can be done by an 

appropriate scaling. 

To show the first part of the max expression, consider the function 



i ii.ii i > 



2JVh+l II^H 2(27V/i+l) 2 11*11 — 2Nh+l 

Ux\\ 2 llxll < 1 



2 11*11 11*11 ^ 2Nh+\ 

Note that this function is nothing else but the Moreau proximal envelope of the function 
2n\+i \\ x \\i El - ^ * s we ^ known that this function is convex, continuously different iable with 
Lipschitz constant L = 1, and that its minimal value ip(x*) = 0, see e.g., [TTJ [17]. Applying 
the gradient method on y?i(x) with x = v where, as before, v is a unit vector in IR d , we 
obtain that for % — 0, . . . , N: 

i / \ 1 (-, ih \ 

(Pi(Xi) = V\ X; — 1 V, 

yiV ; 2Nh + l ' V 2Nh + l) ' 



and (fi\(xi 



If hi \ 1 



2Nh + l\ 2Nh + lJ 2(2Nh + iy 
1 /4Nh + 1 - 2hf 



Therefore, 



ANh + 2 \ 2Nh + 1 



<Pi(x N ) - tpi(x<,) = fx{x N ) 



ANh + 2 

To show the second part of the max expression, we apply the gradient method on 

V2{x) = ^\\x\\ 2 
with Xq = v. We then get that for i — 0, . . . , N: 



if' 2 (xi) = Xi) Xi = (1 - h) l v\ ip 2 (xi) = -(1 - h) 2t 



and hence 

ip 2 {x N ) ~ (f2{x*) = <P2(X N ) = -A} ~ h) 2N 

and the desired claim is proven. □ 
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Numerical experiments we have performed on problem (1GT) suggest that the lower com- 
plexity bound given by Theorem 13.21 is in fact the exact complexity bound on the gradient 
method with < h < 2. 

Conjecture 3.1. Suppose the sequence xq, . . . ,xn is generated by the gradient method GM 
with < h < 2 , then 

T C)2 / 1 

f(x N ) ~ fix,) < — -max , (1 - h) 2N 

2 \2Nh + l ,K 1 

We now turn our attention to the problem of choosing the step size, h. Assuming the 
above conjecture holds true, the optimal step size (i.e., the step size that produces the best 
complexity bound) for the gradient method with constant step size is given by the unique 
positive solution to the equation 

= (l-h) 2N 

2Nh+l 1 ] ' 

The solution of this equation approaches 2 as iV grows. Therefore, assuming the conjecture 
holds true and N is large enough, the complexity of the gradient method with the optimal 
step size approaches 8 ^/^ 2 . This represents a significant improvement, by the factor of 4, 
upon the best known bound on the gradient method ( 13. ip . and also supports the observation 
that the gradient method often performs better in practice than in theory. 

4 A Class of First-Order Methods: Numerical Bounds 

The framework developed in the previous sections will now serve as a basis to extend the 
performance analysis for a broader class of first-order methods for minimizing a smooth 
convex function over M d . First, we define a general class of first-order algorithms (FO) 
and we show that it encompasses some interesting first-order methods. Then, following our 
approach, we define the corresponding PEP associated with the class of algorithms (FO). 
Although for this more general case, an analytical solution is not available for determining 
the bound, we establish that given a fixed number of steps N, a numerical bound on the 
performance of (FO) can be efficiently computed. We then illustrate how this result can be 
applied for deriving new complexity bounds on two first-order methods, which belong to the 
class (FO). 

4.1 A General First-Order Algorithm: Definition and Examples 

As before, our family J 7 is the class of convex functions in C£ and {d,N,L,R} are 

fixed. Consider the following class of first-order methods: 



Algorithm FO 

0. Input: / G C^ 1 (M d ), x G R d . 



1. For i = 0, . . . , N - 1, compute x i+1 = x { - ~ Yll=o h t +1) f'{ x k) 
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Here, n£ E R play the role of step-sizes, which we assume to be fixed and determined 
by the algorithm. 

The interest in the analysis of first-order algorithms of this type is motivated by the 
fact that it covers some fundamental first-order schemes beyond the gradient method. In 
particular, to motivate (FO), let us consider the following two algorithms which are of 
particular interest, and as we shall see below can be seen as special cases of (FO). 

We start with the so-called Heavy Ball Method (HBM). For earlier work on this method 
see Polyak [16], and for some interesting modern developments and applications, we refer 
the reader to Attouch et al. [21 I] and references therein. 

Ex ample 4.1. The Heavy Ball Method (HBM) 

Algorithm HBM 

0. Input: / G C£ 1 (M d ), x E R d , 

1. Xi <r- X ~ f f(xo) 

2. For i = 1, . . . , N — 1 compute: x i+ i = x^ — jf'(xi) + f3(xi — Xj-i) 

Here the step sizes a and (3 are chosen such that < /3 < 1 and < a < 2(1 + /3), see 

m- 

By recursively eliminating the term X{ — X{-\ in step 2 of HBM, we can rewrite this step 

as 

1 * 

x l+1 = x l - I Yl <*P l - k f\x k ), i = 1, . . . , N - 1. 

fc=0 

Therefore, the heavy ball method is a special case of Algorithm FO with the choice 

hl +1) = a ^- k , k = 0, . . . , i, i = 0, . . . N - 1. 

The next algorithm is Nesterov's celebrated Fast Gradient Method |13j. 

Example 4.2. The fast gradient method (FGM) 
Algorithm FGM 

0. Input: / G C^ 1 (R d ), x E R d , 

1. y x <- x , ti <- 1, 

2. For % = 1, . . . , N compute: 

(a) Xi i yi J/'O/i), 

(b) u» - 

(c) Vi+1 <~ Xi + J^iXi - Xi_x). 
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A major breakthrough was achieved by Nesterov in [13], where he proved that the FGM, 
which requires almost no increase in computational effort when compared to the basic gra- 
dient scheme, achieves the improved rate of convergence 0(1/N 2 ) for function values. More 
precisely, one hasE 

f(x N )-r< 2L ^°~^f , V*,e *,(/). (4.1) 

The order of complexity of Nesterov's algorithm is also optimal, as it is possible to show that 
there exists a convex function / G C x £ (R d ) such that when d > 2N + 1, and under some 
other mild assumptions, any first-order algorithm that generates a point x^ by performing 
N calls to a first-order oracle of / satisfies [H| Theorem 2.1.7] 

/(^)-r> 3 ^;;*f . v*. e *.</). 

This fundamental algorithm discovered about 30 years ago by Nesterov [13] has been recently 
revived and is currently subject of intensive research activities. For some of its extensions 
and many applications, see e.g., the recent survey paper Beck-Teboulle [5] and references 
therein. 

At first glance, Algorithm FGM seems different than the scheme FO defined above. 
Here two sequences are defined: the main sequence x ,...,xn and an auxiliary sequence 
yi, . . . ,Un- Observing that the gradient of the function is only evaluated on the auxiliary 
sequence of points {yi}, we show in the next proposition that FGM fits in the scheme FO 
through the following algorithm: 



Algorithm FGM' 

0. Input: / G C^ 1 (R d ), x G R d , 

1. yi «- x , 

2. For % = 1, . . . , N — 1 compute: 

(a) y i+1 <- y» - £ El=i h< k +1) f(yk), 

3. x N <r- y N - t/'Gmt), 



with 




K r = <«i-l) fc = t-l, (z = l,...,A-l, fc = 1 0- ( 4 -2) 



and 



z = 1 



^ — Z > 1. 



3 The bound given here, which improves the original bound derived in j!3| . was recently obtained in 
Beck-Teboulle g]. 
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Proposition 4.1. The points y\, ■ ■ ■ ,Vn, x n generated by Algorithm FGM are identical to 
the respective points generated by Algorithm FGM. 

Proof. We will show by induction that the sequence generated by Algorithm FGM' is 
identical to the sequence y\ generated by Algorithm FGM, and that the value of a; at generated 
by Algorithm FGM' is equal to the value of x^ generated by Algorithm FGM. 

First note that the sequence tj is defined by the two algorithms in the same way. Now let 
{xi, y{\ be the sequences generated by FGM and denote by {y'j}, x' N the sequence generated 
by FGM'. Obviously, y[ = y\ and since t% — 1 we get using the relations 14.21 

V2 = V'l - ^4V(l/i) = Vl - 1 fl + t -^j f{y x ) = Vl - I/'fo) =x l = y 2 . 
Assuming y[ = yi for % = 1, . . . , n, we then have 

k=l 

= Vn~ y(l + tj f 1 )f{Vn) ~ " l)f(yn-l) ~ J £ f { V ' k ) 

= x B + ^ (-7/'^) + ^/'(yn-i) + y'n - y'n.^j 

, tn ~ 1 / \ 

X n , ~r — 7 \%n •^n—l) 

= y n +i- 

Finally, 

x 'n =Vn- Tf'iVN) = Vn~ -j^fXyN) = x N . 

□ 

4.2 A Numerical Bound for FO 

To build the performance estimation problem for Algorithm FO, from which a complexity 
bound can be derived, we follow the approach used to derive problem (|G|) for the gradient 
method. The only difference being that here, of course, the relation between the variables Xi 
is derived from the main iteration of Algorithm FO. After some algebra, the resulting PEP 
for the class of algorithms (FO) reads 

max LR 2 Sn 

GeM( Ar + 1 ) xd ,<5 l eiR 

s.t. tv(G T A hj G) < Si -Sj, i < j = 0, . . . , N, 

ti (G T BijG) < 5 t - Sj, j<i = 0,...,N, (Q) 
tx(G T CiG) < 5i, i = 0,...,N, 
tr(G T D l G + uujG) < -St, i = 0,...,N, 
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where A it j, B it j, Ci and A are defined, similarly to (I3.5p . by 

3 t-1 



1 1 . ^ ^ ^ 



t=i+l k=0 



i t-1 



Bij = \{Ui~ Uj)(Ui - Uj ) T - - h t\ u 3 u l + u kuJ), 

t=j+i fc=o (4.3) 

1 T 

1 i i t-1 

A = 2 MiM ^ + 2 ^ ( UiU * + 

t=l fc=0 

and we recall that ^ G M d is a given unit vector, it, = e^+i G IR^" 1 " 1 and the notation 
% < j = 0, . . . , N is a shorthand notation for % — 0, . . . , N — 1, j — i + 1, . . . , N. 

In view of the difficulties in the analysis required to find the solution of (RjI) . an analytical 
solution to this more general case seems unlikely. However, as we now proceed to show, we 
can derive a numerical bound on this problem that can be efficiently computed. 

Following the analysis of the gradient method, (cf. (fG 7 ]) in §3.2[) we consider the following 
simpler relaxed problem: 

max LR 2 Sn 

G6R( JV + 1 ) xd ,d i 6K 

s.t. tr(G T i i _i,<G) < Si-i -St, i = 1, . . . , N, (Q') 
ti{G T DiG + uujG) < —5i, i = 0,...,N. 



With the same proof as given in Lemma 13.21 for problem (QM, we obtain that a dual 
problem for ( Q' ) is given by the following convex semidefinite optimization problem: 



min —t 

\,T,t 2 



s.t. 



2 ' 



(A, r) G A, 



where 
A = {(A,r)G 



Ai, Aj-Ai+i + Ti = 0, i 



^ 0, 



(DQO 



N-l, X n + t n = 1}. (4.4) 



Note that the data matrices of both primal-dual problems (Q') and (DQ') depend on 
the step-sizes h% . To avoid a trivial bound for problem (Q'), here we need the following 
assumption on the dual problem (DQ'): 

Assumption 1 Problem ( DQ' ) is solvable, i.e., the minimum is finite and attained for 
the given step-sizes . 

Actually, the attainment requirement can be avoided if we can exhibit a feasible point 
(A, r, t) for the problem (DQ'). As noted earlier, given the difficulties already encountered for 
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the simpler gradient method, finding explicitly such a point for the general class of algorithms 



(FO) is unlikely. However, the structure of problem (DQ') will be very helpful in the analysis 
of the next section which further addresses the role of the step-sizes. 

The promised numerical bound now easily follows showing that a complexity bound for 



(FO) is determined by the optimal value of the dual problem (DQ') which can be computed 
efficiently by any numerical algorithm for SDP (see e.g., [BJ QUI HSj)- 



Proposition 4.2. Fix any N,d G N. Let f G C^' 1 (IR a! ) be convex and suppose that xq, 
M. d are generated by Algorithm FO, and that assumption 1 holds. Then, 



x N G 



f(x N )-r <£# 2 val(DQ') 



Proof. Follows from weak duality for the pair of primal-dual problems (QM-(DQ') 



□ 



4.3 Numerical Illustrations 

We apply Proposition 14. 21 to find bounds on the complexity of the heavy ball method (HBM) 
with_| a = 1 and j3 = \ and on the fast gradient method (FGM) with as given in (14. 2p . 
which as shown earlier, can both be viewed as particular realizations of (FO). 

The resulting SDP programs were solved for different values of N using CVX [9j [8] . These 
results, together with the classical bound on the convergence rate of the main sequence of 
the fast gradient method (14. ip . are summarized in Figures [T] and [2J 



Heavy ball method, a=1 p-0.5 ; 
FGM - main sequence 
FGM - auxiliary sequence 
Known bound on FGM (4.1) 





5 10 15 

N 

Figure 1: The new bounds on the heavy ball and fast gradient methods. 

Note that as far as the authors are aware, there is no known convergence rate result for 
the HBM on the class of convex functions in C£ . As can be seen from the above results, the 
numerical bound for HBM behaves slightly better than the gradient method (compare with 

4 According to our simulations, this choice for the values of a, ft produce results that are typical of the 
behavior of the algorithm. 
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Figure 2: The new bounds for HBM and FGM versus the classical bound on Nesterov's 
algorithm. 



the explicit bound given in Theorem I3.ip . but remains much slower than the fast gradient 
scheme (FGM). 

Considering the results on the FGM, note that the numerical bounds for the main se- 
quence of point Xi and the corresponding values at the auxiliary sequence yi of the fast 
gradient method are very similar and perform slightly better than predicted by the classical 
bound (I4.1I) . To the best of our knowledge, the complexity of the auxiliary sequence is yet 
unknown, thus these results encourage us to raise the following conjecture. 

Conjecture 4.1. Let xq,Xi, . . . and j/i, f/2> • • ■ be the main and auxiliary sequences defined 
by FGM (respectively), then {f(xi)} and {f(yi}} converge to the optimal value of the problem 
with the same rate of convergence. 

5 A Best Performing Algorithm: Optimal Step Sizes 
for The Algorithm Class FO 

We now consider the problem of finding the "best" performing algorithm of the form FO 



with respect to the new bounds. Namely, we consider the problem of minimizing val (0/), the 
optimal value of (Q'), with respect to the step sizes h := (h^) < k<i < N defining the algorithm 



FO, and which are now considered as unknown variables in FO. 

We denote by Aij(h) and Di(h), the matrices given in (14. 3p . which are functions of the 
algorithm step sizes h. The resulting bound derived in Proposition 14.21 is thus a function of 



h, and the problem of minimizing val (DQ') with respect to the step sizes h thus consists of 
solving the following bilinear problem: 

min (L : fEi.M,-MW+E£.nAW H t0 ,(A,r )e A), (BIL, 

h,\,r,t 2 \ ttT nt / 
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with A defined as in (I4.4p . 

Note that the feasibility of (IBIL[) follows from the proof of Theorem 13. 2\ where an explicit 
feasible point is given to (IDG')) , which is a special instance of (IBIL[) when the steps (h^) 
are chosen as in the gradient method. 

From the definition of the matrices Aij(h) and Di(h), we get 

N N N „ N 



XjAi-i t i(h) +y^TiDi(h) = ^^Aj(ui_i -Ui)(ui-i - Uif + \ TjUjuJ 

i=l i=0 i=l i=0 

1 N i-1 / i \ 

+ \ E E + ^ E ^ + 



i=l k=0 \ t=k+l 

Introducing the new variables: 



, k = \ i hf+r i £ hf, i = l,...,N, fc = 0,...,i-l (5.i; 



t=fe+i 



and denoting r = (ri ik )o<k<i<N) we obtain the following linear SDP relaxation of (1BIL|) : 

min (it : r) H ^ 0, (A, r) G A ) , (LIN) 



r,A.r,t t 2 V V ¥■ 

where 

j N N N i-1 

S(r, A,t) = Xi{ui-i - Ui){ui^i - Ui) T + - ^TiUiuf + ^EE 7 *-* + u k uj). 

i=l i=0 i=l k=0 

This convex SDP can now be efficiently solved by numerical methods. As the following 
theorem shows, its solution can be used to construct a solution for fIBILI) with optimal step 
sizes h. 



Theorem 5.1. Suppose (r*,X*,r*,t*) is an optimal solution for (LIN), then (h,X*,r*,t*) is 
an o 
rule 



an optimal solution for (IBILj) . where h = )o<Jt<t<JV ^ s defined by the following recursive 



'i Z^t=k+l IL k ' i.k \* I r> 

A ^ U , i = l,...,N, fc = (5.2) 

A* = 

Proof. As (LIN) is a relaxation of ( 1BILI) . it is enough to show that ( IBILj) can achieve the 
same objective value. Let (r*,X*,r*,t*) be an optimal solution for (LIN). If A* ^ for all 
1 < i < N, then (15. 2p satisfies all the equations in ( 15. ip and therefore (h, A*, r*, t*) is feasible 
for (IBILj) . 

Suppose A,^ = for some 1 < m < N and that m is the maximal index with this 
property. Then by the equality and non-negativity constraints in (LIN), we get that A^ = 
A2 = ■ ■ • = A^ = and Tq = r* = • • • = r^ t _ 1 = 0. Let S := S(r, A*, r*), then by the positive 
semidefinite constraint in (LIN), we have S >z 0. From the linear equalities connecting A 
and r it follows that 



Si. 



2 

1(\* 



2 



(A* + A*_ 1 + Tt 1 ) = A*, i = 2,...,JV 
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and we get that S^i = ■ ■ ■ = S m>m = 0. By the properties of positive semidefinite matrices 
we now get that r* k = for % = 1, . . . , m and k = 0, . . . , % — 1, hence the set of equations 
( 15. ip with the chosen values of K$ is consistent. □ 

The optimal value of (LIN) for various values of N is summarized in Figure El The 
resulting new algorithm with the computed optimal step sizes h^ 1 is illustrated for N = 5 
and given in Figure HI As can be seen from these results, (compare with Figure [2]) the 
performance of the new algorithm is almost exactly two times better than the performance 
of the fast gradient method. 



N 
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Figure 3: The value of the optimal value of (LIN) for various values of N. 



x 2 ^x l + ^/'(* ) + (xi) 

x 3 <- x 2 + °-^f'(x ) + -^f( Xl ) + ^/'(x 2 ) 

X 4 ^ X 3 + °-^f(x ) + ^Pf(x0 + °-^f(x 2 ) + ^P/'(*3) 

x 5 ir- x 4 + °-^f(x ) + ^Pf ( Xl ) + 2f*/' (x 2 ) + ^Pf (x a ) + ^f(i 4 ) 
Figure 4: A first-order algorithm with optimal step-sizes for N = 5. 
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A Proof of Lemma 



We now establish the positive definiteness of the matrices So and S± given in (13.81) and (13. 9p . 
respectively. 



A.l S yO 

We begin by showing that So is positive definite. Recall that 

/ 2Ai -Ax 
-Ax 2A2 — A2 

— A2 2A3 —A3 



So 



-Xn-i 2Atv —Xn 
-A,v 1 



for 



A, 



2JV+l-i' 

Let us look at x t Sqx for any x = (xq, . . . , xn) t 



i = l,...,N. 



N-l 



N-l 



X 



Sqx — 2Aj + ix^ — 2 Xi + \XiXi + i + x 2 N 



8=0 

JV-1 



i=0 



N-l 



^ X i+ i{x i+1 - Xif + AiXq + y^(Aj +1 - \i)xl + (1 - A 



n)x 2 n 



i=0 



which is always positive for i^O. We conclude that Sq is positive definite. 



A.2 Si^O 

We will show that Si is positive definite using Sylvester's criterion^. 



Recall that 



for 



5*i 



( 2Ax A2 — Ax 

A2 ~~ Ax 2A2 

Aat — Ayv_x Xn ~ Xn-1 

\ 1 — Atv 1 — Ajv 



A, 



2N+1 -1 



Xn — Xn-i 1 — Xn\ 
Xn — Xn-i 1 — X N 



2Xn 
1-X n 



i = l N. 



1-Ajv 
1 / 



5 Despite the interesting structure of the matrix Si, this proof is quite involved. A simpler proof would 
be most welcome! 
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A recursive expression for the determinants We begin by deriving a recursion rule 
for the determinant of matrices of the following form: 

\ 



M, 



' d 


a i 


a 2 ■ 






a\ 


d\ 


a 2 






a 2 


(>2 


d 2 




an 




Ofe-1 






dk 






a k 




dk 



To find the determinant of M k , subtract the one before last row multiplied by from 



the last row: the last row becomes 

(0, . . . , 0, a k 



ak 



4-1; 4 



Ofc-1 

Expanding the determinant along the last row we get 



■Ofc). 



det M k = (4 



Ofc-l 



a*.) detM fc . 



Ofc-i 



4-i) det(M k )k, 



fc-i 



where (M; 



k)k,k-l 



denotes the k,k — I minor: 



(M k ) k , 



fe-i 



/ 4 




a 2 




-2 




a! 


4 


a 2 


a k - 


-2 






a 2 


4 


a k - 


-2 










4- 


-2 




\Ofe_l 


Ofc-l 


Ofc-l 


a-k- 


-1 





If we multiply the last column of (Mk)k,k-i by we get a matrix that is different from 



Mk-i by only the corner element. Thus by basic determinant properties we get that 
Ofc-i 



a k 



det(M k ) k ,k-i = det M k -\ + (a k -i - 4-i) det M k - 2 . 



Combining these two results, we have found the following recursion rule for detil4, k > 2: 

a k 



det M k = (4 



Ofc-l 



-a k ) det M fc _ 



Ok_ 

ak-i 



4-i) 



fc ■ det M fe _i + (a fe — 4-i) det M fc _ 2 



(4 - 



Ok_ 

ak-i 



-a k - [a k 



a k~l 

a k 



Qfc-i 



d 



k-l 



Ok_ 
Ofc-1 



a k -i 
detM fc _i 



ak 



or 



det M fe = 4 



Ofc-l 



+ 



a fe 4-i 



*fc-i 



det M fe _! - a|( 1 - 



Qfc-i 

2 



4-i det M fc _ 2 



fc-i 



det M fe _ 2 . 



(A.l) 



Obviously, the recursion base cases are given by 

det Mq = do, 

det Mi = 44 — a\. 
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Closed form expressions for the determinants Going back to our matrix, Si, by 
choosing 

di = 2 \t 1 , i = 0,...,iV-l 
2N-i ' ' 

d N = 1 

ai = wh~ 2N + l-V i = ^-^~ 1 
iV 1 

a N = 1 



iv + i iv + r 

we get that M& is the k + l'th leading principal minor of the matrix Si. The recursion 
rule (IA.1I) can now be solved for this choice of a, and dj. The solution is given by: 



detMi. 



(2iV + l) 2 / A 2N-2k-l \ -A- 2N + ANi - 2i 2 + 1 



EZ,i V — — ± \ T — r ZriV "l-UVl — &l -|- ± 

o 2iV + 4JVi - 2z 2 + 1 J 11 (2iV + 1 -z) 2 ' 



for k = 0, . . . , N — 1, and 



, , r (2N + 1) 2 ^ 2N + 4Ni-2i 2 + 1 /A oN 

det M N = det L, FT (2jy + j _ - )a • (A.3) 



Verification We now proceed to verify the expressions (1A.2I) and (1A.3[) given above. We 



will show that these expressions satisfy the recursion rule (lA.ip and the base cases of the 
problem. We begin by verifying the base cases: 

, (2N+1) 2 / 2JV- 1\ 1 1 

detMo = W^l 1 + ^TTj^TT = ^ = rfo ' 



(2N+1) 2 / 2N-3 2iV-3\ 1 6JV - 1 
det Mi = ^— — 1 1 



(2A^ - l) 2 V 2N+1 QN-lJ 2N + 1 (2N) 2 
28N 2 - 20N - 1 4 / 2 1 



AN 2 (2N-1) 2 N(2N-1) \2N-1 2N 
Now suppose 2 < k < N. Denote 

j 2a? a|4_i 
a/c = «fc 1 2 

Ofc-l a k-l 

(2iV+l)fc-fc 2 -l i -\j 

(2iV-/c) 2 ' 

, 2iV 2 +2jV~l j, _ at 

' (2JV+1) 2 ' K - iV 



_ 4-i 

ajfc-i 

(4fcjV-2jV-2fc 2 +4fc-l) 2 
(2Af-fc) 2 (2V-fc+l) 2 ' 
(2jV 2 +2jV-l) 2 
(7V+l) 2 (2iV+l) 2 ' 



k < N 
k = N 



2 



dodi — a x 
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then the recursion rule (IA.10 can be written as 

det M k = a k det M k _ x - k det M k _ 2 . 



Further denote 



(2N + 1) 2 

9i = 2N -2i-l, i = 0,...,N -1, 

Xj = - , i — 0, . . . , N — 1, 

1 2N + ANi — 2i 2 + 1 ' ' 

2N + ANi -2i 2 + l 

Vi= (2JV+1-0 2 ' * = 0,...,iV-l, 



then the solution ( 1A.2I) becomes 



det M fc = f k 1 + ^ Xj J ]~[ y i; 



and (1A.3|) becomes 



i=o / i=0 



JV-l 



(2A^+ 1)' 

i=0 

Substituting f lA~2l) in the RHS of flA~ll we get that for k = 2, . . . , N 



detMN = \ N+1 )2 



a k det M fc _i - det M fc _ 2 

(fe-l \ fc-1 / fe-2 \ fe-2 

i + ^ ) n m ~ Pkfk~2 1 1 + 9k-2 ^2 Xi ) n ^ 
i=0 / i=0 \ i=0 / i=0 

CKfc/jfc-l 1 + g k -lX k -l + 9k-l ^2 Xi ) — fk-2 —fk-29k-2 ^2 Xi 



i= o / Vk ~ l Vk - 1 i=0 



«fc/fc-i(l + gk-iXk-i) —fk-2 + ( a k f k -ig k -i — fk-2gk-2 ) 



fc-2 

.r 

2=0 



It is straightforward (although somewhat involved) to verify that for k < N 

a fe / fc _i(l + fi-fc-i^fc-i) —fk-2 = fkUk(l + gkXk-i + #fc£fc), 

2/fc-i 



and 



a kJk-igk-l Jk-29k-2 — Jk9kUk- 

Vk-1 
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We therefore get 



a k det M fc _i - (3 k det M k _ 2 

fkUk(l + Qk^k-i + 9kXk) + fk9kVk ^2 Xi II Vi 



k-2 



fc-1 



i=0 



i=0 



= fk ( i + 9k Xi ) n yi 

= detM fc , 

and thus (1A.2[) satisfies (lA.ip . It is also possible to show that 

ajv/jV-l(l + #7V-l£Ar_i) - 7 Jat-2 " 



(7V + 1) 



2 ' 



Q!jv/iV-lfl , JV- 



/3 



Jn-29n- 



thus, for fc = iV 



o/v det Mjv_! — /3jy det M N _ 2 



N-l 



(N + 
detM N 



i=0 



and the expression (IA.3I) is also verified. 

To complete the proof, note that the closed form expressions for det M k consist of sums 
and products of positive values, hence det M k is positive, and thus follows from Sylvester's 
criterion that Si is positive definite. 
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